1 Introduction

The buying and selling of houses has always been the most concerned issue of the people, and it is a major issue related to people’s livelihood. Housing prices are an important indicator that reflects the performance of the economy. Real estate developers and home buyers pay close attention to housing price fluctuations. In a complete market environment, housing prices are jointly determined by demand and supply. However, in the current society, housing has both residential and investment functions, making the research on housing prices very complicated. To predict long-term housing prices has become an extremely complex and challenging task. The establishment of a real estate price prediction system is a key task for the healthy development of the current real estate industry. However, the rapid rise in housing prices has always plagued most of the middle and low-income groups, and this has developed into a sensitive issue in people’s daily life. Therefore, studying the influencing factors of real estate prices and predicting them will help the people of the middle and low-income groups to choose the right time to buy a house, and it will also help the government officials understand the trend of real estate prices and regulate them. Having a simple predictive and inferential method to model housing prices helps commerce determine fair prices and allows governments to determine property taxes. This project aims to learn how different factors may affect home sales price by building linear models. Although data that will be utilized was collected in Melbourne, Australia in 2017, the concept that location and home attributes correlate with housing prices could reasonably apply broadly and internationally.

2 Introduction Re-Write:

Housing prices are an important indicator of the strength of the economy. House price prediction can help real estate developers determine the selling price of a house, allow buyers make informed choices about potential purchases, and be beneficial for property investors in determining price trends across different locations. Hence having a simple predictive and inferential method to model housing prices can be of great significance to the financial market; however, predicting long-term housing prices has become a complex and challenging task. This paper discusses our project on determining how different factors may affect home sales price by building linear models. The data used in this project was collected in Melbourne, Australia in 2017. Melbourne is a large metropolitan city with a strong real estate market in a region of Australia that experienced a 4.2 percent growth rate in property sales 2017. We believe the factors that determine housing pricing in our model could have broad applications to other locations and countries.

Our project sought to answer the following questions: 1. Understand if housing prices in Melbourne, Australia can be predicted using this dataset. 2. Determine what variables have the greatest impact on housing price. 3. Analyze the impacts of location, seller, and construction attributes of homes on the housing market in Melbourne, Australia.

3 Exploratory Data Analysis (EDA)

The following are excerpts and graphs from our exploratory data analysis. This part of the project familiarizes the reader with our dataset’s attributes as well as lays the foundation for the variables we will include in our linear model. The results of our EDA will also inform the future direction of the project. We completed our EDA with the folowing goals in mind:

3.1 Goals

  1. Understand which attributes of a home and its sale determine final sale price
  2. Attempt to build a reasonable model for inference and/or prediction for final sale price

3.2 The Melbourne Housing Snapshot Dataset

The independent variables mainly reflect the situation of the house from three dimensions: a. what type; b. quality, grade; c. quantity, area.Before EDA, the details and introduction of the existing Melbourne house data variables are as follows:

  • Home Sales in 2017
    • Location
    • Construction
    • Sale
  • Variables: 21
    • Numeric: 12
    • Categorical: 9

3.3 The Variables

Rooms: Number of rooms

Price: Price (AUS$)

Method: Method of sale - 5 categories

Type: House, Unit, Townhouse - 3 categories

SellerG: Real Estate Agent - 268 categories

Date: Date sold

Distance: Distance from Central Business District

Regionname: Region name - 8 categories

Propertycount: Number of properties that exist in the suburb

Bedroom2 : Number of Bedrooms

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

YearBuilt: Year home built

CouncilArea: Governing council for the area - 34 categories

Lattitude, Longtitude: GPS location

Suburb: Suburb name - 314 categories

3.4 Summary of Price Statistics

Mean: $1,075,684

SD: 639310.724

data.full$Price
Min 85000
Q1 650000
Median 903000
Mean 1075684
Q3 1330000
Max 9000000

3.5 Select Data Pairs

3.6 Corrleations

Due to the large number of feature columns in the dataset, it is difficult to grasp. Therefore, before further feature mining, take a look at which variables are highly correlated with House Price (mainly focusing on numeric variables). The result is shown as follows:

We can draw conclusion from the above picture that the price has a positive relationship with the type of room, but has a negative relationship with distance. And there are also strong correlations between some variables, such as Rooms and Bedrooms, Rooms and Bathrooms, Bedrooms and Bathrooms, which may cause multicollinearity. We need to pay attention to feature selection.

3.7 Selling Price

In order to forecast the sales price more reasonably, before making the forecast, there is a comparative analysis of its distribution. SalePrice is a predictor variable. Take a graph to see its distribution:

Each point of a string of numbers in the data is a certain quantile of the data. Pair these points (called sample quantile points) with the corresponding theoretical quantiles to make a scatter plot. The data obeys the normal distribution, then the graph should look like a straight line, otherwise it does not obey the normal distribution. It can be seen from the distribution diagram of selling price that selling price does not follow a normal distribution. In the case of not obeying the normal distribution, a new method should be selected to change the data to obey the normal distribution. Thus, the log operation is implemented in selling price.

3.8 Log Selling Price

From the Q-Q plot of log selling price, we could draw that the graph looks like a straight line. Thus, log selling price obeys normal distribution.

3.9 Map of Melbourne Sales

It can be seen from the figure that the sales areas are mainly concentrated in Eastern metropolitan, Southern metropolitan, Northern Metropolitan, Western Metropolitan and South-Eastern metropolitan. Therefore, the fluctuation of housing prices will greatly affect these areas, and these areas account for about 5/6 of Melbourne.

In the conclusion of the correlation analysis, the price is related to the region. In other words, the sale price of the house is also different in different regions. In addition, the price is also related to the number of rooms; the more rooms there are, the higher the price will be. Finally, housing prices are also related to the type of house. All the results are shown below:

3.10 Price by Region

3.11 Price by Number of Rooms (<10 Rooms)

3.12 Price by Type of Home

3.13 Test of Independence by Group (Pearson \(\chi^2\))

3.13.1 Type, Rooms, Regionname, SellerG

\(H_0\): All means equal by group

All reject \(H_0\) with p-value\(<2\times 10^{-16}\)

3.14 Price by Region and Type

The picture below shows the price with different region and type of house.

3.15 Transform Data - Homogeneity

3.16 Transform Data - Normality

4 Linear Modelling

4.1 First Attempt at Linear Model

In this first model, we set the regression model as: Price ~ Rooms + Landsize + Distance + Bedroom2 + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname). In order to determine whether the first model meets the requirements, the necessary VIF checks are useful. From the picture, we can know that the bedroom2 is not suitable for house price regression analysis. Thus, the second model regression analysis is modified with no bedroom2 variable.

4.2 Linear Model 2: Removed the Variable with Highest VIF

In this model, we set the regression model as: Price ~ Rooms + Landsize + Distance + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname). Similarly with Model 1, the VIF’s results shown as follows:

4.3 Model Coefficients

In this model, the factor (Regionname) Western Metropolitan is the variable with highest VIF value. The model coefficients are shown as follows:

The p-value of Landsize is the highest, which is larger than 0.05. Thus, this variable should be dropped.

4.4 Linear Model 3: Considered Interactions

In this model, we analyze the interaction of rooms and region. The result is shown as follows:

The p-value for model with interaction shows that all of the variables are suitable and acceptable.

4.5 Linear Model 4: Removed Land Size

From the consequence of model 2, the variable of landsize should be dropped. In this model, the model is set as: Price ~ Rooms + Distance + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname).

From the results of VIF Value for model 4, the factor (Regionname) Western Metropolitan should be dropped. At the same time, the VIF value of this variable is the only variable that exceeds 5.

4.6 Model 4 Coefficients

# Residual Analysis

4.7 Homogeneity? No

4.8 Normal? Not Quite

4.9 Influence? Yes

4.10 Remove Influence Points

5 Proposed Model

This final proposed model and details of related coefficients are shown as follows:
Observations 4955 (4548 missing obs. deleted)
Dependent variable Price
Type OLS linear regression
F(15,4939) 425.77
0.56
Adj. R² 0.56
Est. S.E. t val. p
(Intercept) -129445309.64 17287919.00 -7.49 0.00
Rooms 255713.04 9273.23 27.58 0.00
Distance -44501.15 1467.41 -30.33 0.00
Bathroom 115532.73 11453.40 10.09 0.00
Car 45801.33 7532.08 6.08 0.00
BuildingArea 1794.92 90.67 19.80 0.00
Lattitude -757249.47 124821.36 -6.07 0.00
Longtitude 696894.60 116542.04 5.98 0.00
Propertycount -3.69 1.52 -2.42 0.02
factor(Regionname)Eastern Victoria 188304.28 103229.34 1.82 0.07
factor(Regionname)Northern Metropolitan -55889.99 30219.79 -1.85 0.06
factor(Regionname)Northern Victoria 598550.10 116506.82 5.14 0.00
factor(Regionname)South-Eastern Metropolitan 169831.25 51506.18 3.30 0.00
factor(Regionname)Southern Metropolitan 212777.72 27309.69 7.79 0.00
factor(Regionname)Western Metropolitan -86943.29 38731.01 -2.24 0.02
factor(Regionname)Western Victoria 515064.40 135131.01 3.81 0.00
Standard errors: OLS

5.1 Testing \(R^2\)

\[ \begin{equation} R^2 = 1- \dfrac{RSS}{TSS} \end{equation}=0.441\]

6 Conclusion

The previous analysis did not deal with outliers, and the processing of outliers may also have a certain effect on result optimization. Through the analysis of this data set, the content of linear regression was practiced, and the final effect was not bad. For the future, some useful method can be implemented, such as further explore log transformation, consider GLM with log link, what to do about factors with many levels (100’s)? deal with missing data and improve Prediction.

6.1 Future Work

  • Further explore log transformation
  • Consider GLM with log link
  • What to do about factors with many levels (100’s)?
  • Missing data
  • Improve Prediction

7 8 Bibliography

Dataset available: https://www.kaggle.com/dansbecker/melbourne-housing-snapshot Thorne,S. (2019, November 3) How the Australian Property Market Performed in 2017. Retrieved from www.openagent.com.au/blog/how-the-australian-property-market-performed-in-2017#. Mansfield, E. R., & Helms, B. P. (1982). Detecting multicollinearity. The American Statistician, 36(3a), 158-160. Daoud, J. I. (2017, December). Multicollinearity and regression analysis. In Journal of Physics: Conference Series (Vol. 949, No. 1, p. 012009). IOP Publishing.